library(tidyverse)
library(plotly)
data <- read_csv('./gapminder_clean.csv') %>%
select(-1) %>%
rename(
co2em = `CO2 emissions (metric tons per capita)`,
popden = `Population density (people per sq. km of land area)`,
lifeExp = `Life expectancy at birth, total (years)`,
energy = `Energy use (kg of oil equivalent per capita)`,
imports = `Imports of goods and services (% of GDP)`,
)
data1962 <- data %>%
filter(Year == 1962) %>%
select(gdpPercap, co2em) %>%
drop_na()
ggplot(data = data1962) +
geom_point(mapping = aes(
x = gdpPercap,
y = co2em)) +
labs(x = "GDP per capita", y = "CO2 emissions per capita (metric tons)")
cor.test(data1962 %>% pull(gdpPercap), data1962 %>% pull(co2em))
##
## Pearson's product-moment correlation
##
## data: data1962 %>% pull(gdpPercap) and data1962 %>% pull(co2em)
## t = 25.269, df = 106, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8934697 0.9489792
## sample estimates:
## cor
## 0.9260817
corrs <- data %>%
group_by(Year) %>%
select(Year, gdpPercap, co2em) %>%
drop_na() %>%
summarise(correlation = cor(gdpPercap, co2em))
maxi <- lapply(corrs, max)
The strongest correlation is 0.9387918 in the year 2007.
max_em_year_data <- data %>%
filter(Year == maxi$Year) %>%
select(gdpPercap, co2em, pop, continent, `Country Name`) %>%
drop_na()
fig <- ggplot(data = max_em_year_data) +
geom_point(aes(
x = gdpPercap,
y = co2em,
size = pop,
color = continent,
text = paste("Country: ", `Country Name`,
"\nGDP: ", gdpPercap,
"\nCO2 emissions: ", co2em))) +
xlab("GDP per capita") +
ylab("CO2 emissions per capita (metric tons)") +
ggtitle(str_glue("GDP vs CO2 emissions per capita in ", maxi$Year))
ggplotly(fig, tooltip = "text")
For sections requiring a statistics test, I use the standard alpha of 0.95.
Let’s get a general idea of what we’re looking at. We filter our data to only see the continents and energy usages, and we create a frequency graph of it.
data_energy <- data %>%
select(continent, energy) %>%
drop_na()
ggplot(data_energy, mapping = aes(energy, color = continent)) +
geom_freqpoly(binwidth = 500)
It looks like we’ve got peaks at the same location for Africa, the Americas and Asia. Europe and Oceania are doing their own thing.
The next question is can we perform ANOVA and statistically determine relationships in the data. First, we check the assumptions.
Let’s observe how normal the data are.
ggplot(data_energy, mapping = aes(sample = energy)) +
facet_grid(continent ~ .) +
stat_qq() +
stat_qq_line()
Africa, Oceania and Europe look approximately normal, but the others do not. Let’s apply a data transformation. We get an empty tibble when filtering for energy values less than or equal to zero. So, it is safe to apply the logarithm to the energy usage data.
data_energy %>%
filter(energy <= 0)
## # A tibble: 0 x 2
## # … with 2 variables: continent <chr>, energy <dbl>
Let’s try the data transformation.
data_energy_t <- data_energy %>%
mutate(energy_transformed = log(energy))
ggplot(data_energy_t, mapping = aes(sample = energy_transformed)) +
facet_grid(continent ~ .) +
stat_qq() +
stat_qq_line()
Oceania remains normal. The other continents’ data look more normal now. They still are not perfect, but it’s okay since they have relatively large sample sizes, as the table below illustrates.
data_energy_t %>%
group_by(continent) %>%
summarise(count = n())
## # A tibble: 5 x 2
## continent count
## <chr> <int>
## 1 Africa 199
## 2 Americas 188
## 3 Asia 185
## 4 Europe 256
## 5 Oceania 20
The data come from different continents and are at least 5 years apart. Due to this, assume the data are independent and identically distributed.
How do the variances look?
ggplot(data_energy_t) +
geom_boxplot(mapping = aes(energy_transformed, continent)) +
xlab("Log of energy usage (kg of oil equivalent per capita)") +
ylab("Continent")
The variances of the transformed data look roughly the same. Oceania stands out as having a smaller variance than the others. Asia looks like it has a larger variance.
test_results <- aov(energy_transformed ~ continent, data = data_energy_t)
summary.aov(test_results)
## Df Sum Sq Mean Sq F value Pr(>F)
## continent 4 342.8 85.70 117.7 <2e-16 ***
## Residuals 843 613.9 0.73
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is less than 0.001. In our assumptions, we saw that the transformed data were not exactly normal, and their variances were not exactly equal. Taking that into account, I still think we can reject the null hypothesis. That is, we can reject that the log of the group means are equal. Since the logarithm is a monotone increasing function, we can reject that the group means are equal.
TukeyHSD(test_results)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = energy_transformed ~ continent, data = data_energy_t)
##
## $continent
## diff lwr upr p adj
## Americas-Africa 0.62609440 0.3888318 0.8633570 0.0000000
## Asia-Africa 0.53390365 0.2956539 0.7721534 0.0000000
## Europe-Africa 1.61347063 1.3930062 1.8339350 0.0000000
## Oceania-Africa 1.96990907 1.4226918 2.5171263 0.0000000
## Asia-Americas -0.09219075 -0.3337751 0.1493937 0.8351479
## Europe-Americas 0.98737623 0.7633124 1.2114401 0.0000000
## Oceania-Americas 1.34381467 0.7951374 1.8924920 0.0000000
## Europe-Asia 1.07956698 0.8544581 1.3046759 0.0000000
## Oceania-Asia 1.43600542 0.8869005 1.9851103 0.0000000
## Oceania-Europe 0.35643844 -0.1851867 0.8980636 0.3747658
For every pair except for (Asia, Americas) and (Oceania, Europe), the p-value is less than 0.0001, so for those, we reject the null hypothesis that the logarithm of the energy usage means are the same.
We filter our data to imports of goods and services in Europe and Asia. Let’s take a look at a histogram.
data_imports <- data %>%
filter(Year >= 1990 & continent %in% c('Europe', 'Asia')) %>%
select(continent, imports) %>%
drop_na()
ggplot(data = data_imports) +
geom_histogram(
mapping = aes(x = imports, fill = continent),
position = 'identity',
alpha = 0.5,
binwidth = 10,
)
The peaks seem to line up, but Europe as a bunch more data points near the peaks, compared to Asia. At a glance, I would guess that the two continents have roughly the same imports of goods and services, measured in percent of GDP. Can we perform an unpaired t-test to check our guess? Let’s check the assumptions.
We have histograms above, but I find Q-Q plots easier to look at.
ggplot(data_imports, mapping = aes(sample = imports)) +
facet_grid(continent ~ .) +
stat_qq() +
stat_qq_line()
Let’s do a data transformation.
data_imports_t <- data_imports %>%
mutate(imports_t = sqrt(imports))
ggplot(data_imports_t, mapping = aes(sample = imports_t)) +
facet_grid(continent ~ .) +
stat_qq() +
stat_qq_line()
The transformation helps a bit, but the data at the ends do not fit the lines well. Let’s proceed skeptically, assuming that the data are approximately normal.
Europe and Asia are completely different regions of land. Assume each datum comes from only one of the continents.
We use the time argument again. The data are taken 5 years apart, so for each group of data, assume that the data are independent and identically distributed.
ggplot(data_imports_t) +
geom_boxplot(mapping = aes(imports_t, continent)) +
xlab("Square root of imports of goods and services (percent of GDP)") +
ylab("Continent")
We have our eyes on Welch’s t-test, which is robust against unequal variances. For a normal unpaired t-test, we’d still be fine because the sample sizes are roughly equivalent.
data_imports_t %>%
group_by(continent) %>%
summarise(count = n())
## # A tibble: 2 x 2
## continent count
## <chr> <int>
## 1 Asia 98
## 2 Europe 114
We see outliers on the Asia data, which may affect our results negatively. We continue anyways, acknowledging these shortcomings.
x <- data_imports_t %>% filter(continent == 'Europe')
y <- data_imports_t %>% filter(continent == 'Asia')
t.test(x$imports, y$imports)
##
## Welch Two Sample t-test
##
## data: x$imports and y$imports
## t = -1.3552, df = 137.53, p-value = 0.1776
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -12.433240 2.321099
## sample estimates:
## mean of x mean of y
## 41.78924 46.84531
Our p-value is 0.1776. I’m skeptical because we did not satisfy some of our parametric assumptions as much as I wanted. We run the nonparametric Mann-Whitney test for assurance.
wilcox.test(x$imports, y$imports)
##
## Wilcoxon rank sum test with continuity correction
##
## data: x$imports and y$imports
## W = 5465, p-value = 0.7867
## alternative hypothesis: true location shift is not equal to 0
Our p-value here is 0.7867. Both tests indicate that we can retain the null hypothesis. However the null hypothesis of the parametric and nonparametric tests are different. In this case, I will retain the null hypothesis of the parametric test because we have decent results, and the statement is stronger.
That is, we retain the null hypothesis that square root of import means are the same. Because the square root function is monotone increasing, this is the same as saying the import means between Europe and Asia are the same.
data_popden <- data %>%
group_by(`Country Name`) %>%
select(`Country Name`, popden) %>%
summarise(avg_popden = mean(popden, na.rm = TRUE)) %>%
arrange(desc(avg_popden))
num_countries_shown <- 20
ggplot(data = head(data_popden, n = num_countries_shown)) +
geom_bar(
mapping = aes(x = avg_popden, y = reorder(`Country Name`, avg_popden)),
stat = "identity") +
xlab("Average population density (people per sq. km of land)") +
ylab("") +
ggtitle(str_glue(num_countries_shown, " most population dense countries 1962-2007"))
The country with the highest average population density between 1962 and 2007 is Macao SAR, China.
data_lifeExp <- data %>%
select(`Country Name`, Year, lifeExp) %>%
drop_na() %>%
group_by(`Country Name`) %>%
filter(any(Year == 1962)) %>%
mutate(lifeExpSince1962 = lifeExp - lifeExp[Year == 1962]) %>%
ungroup()
countries_highest <- data_lifeExp %>%
group_by(`Country Name`) %>%
mutate(lifeExpTotalChange = lifeExp[Year == 2007] - lifeExp[Year == 1962]) %>%
summarise(lifeExpChange = max(lifeExpTotalChange)) %>%
arrange(desc(lifeExpChange))
cutoff_lifeExpChange <- min(head(countries_highest, n = 10)$lifeExpChange)
data_lifeExpTop <- data_lifeExp %>%
group_by(`Country Name`) %>%
filter(lifeExp[Year == 2007] - lifeExp[Year == 1962] >= cutoff_lifeExpChange)
fig <- ggplot(data = data_lifeExpTop) +
geom_line(mapping = aes(
x = Year,
y = lifeExpSince1962,
color = `Country Name`)
) +
ylab("Change in life expectancy at birth since 1962 (years)")
ggplotly(fig)
The Maldives had the highest increase in life expectancy at birth from 1962 to 2007.